236 PART 5 Looking for Relationships with Correlation and Regression
Executing a Multiple Regression
Analysis in Software
Before executing your multiple regression analysis, you may need to do some prep
work on the variables you intend to include in your model. In the following sec-
tions, we explain how to handle the categorical variables you plan to include. We
show you how to examine these variables through making several charts before
you run your analysis. If you need guidance on what variables to consider for your
models, read Chapter 20.
Preparing categorical variables
The predictors in a multiple regression model can be either numerical or categori-
cal (Chapter 8 discusses the different types of data). In a categorical variable, each
category is called a level. If a variable, like Setting, can have only two levels, like
Inpatient or Outpatient, then it’s called a dichotomous or a binary categorical vari-
able. If it can have more than two levels, it is called a multilevel variable.
Figuring out the best way to introduce categorical predictors into a multiple
regression model is always challenging. You have to set up your data the right
way, or you’ll get results that are either wrong, or difficult to interpret properly.
Following are two important factors to consider.
Having enough participants in each level
of each categorical variable
Before using a categorical variable in a multiple regression model, you should
tabulate how many participants (or rows) are included in each level. If you have
any sparse levels — row frequencies in the single digits — you will want to con-
sider collapsing them into others. Usually, the more evenly distributed the num-
ber of rows are across all the levels, and the fewer levels there are, the more
precise and reliable the results. If a level doesn’t contain enough rows, the pro-
gram may ignore that level, halt with a warning message, produce incorrect
results, or crash. Worse, if it produces results, they will be impossible to interpret.
Imagine that you create a one-way frequency table of a Primary Diagnosis vari-
able from a sample of study participant data. Your results are: Hypertension: 73,
Diabetes: 35, Cancer: 1, and Other: 10. To deal with the sparse Cancer variable, you
may want to create another variable in which Cancer is collapsed together with
Other (which would then have 11 rows). Another approach is to create a binary
variable with yes/no levels, such as: Hypertension: 73 and No Hypertension: 46.
But binary variables don’t take into account the other levels. You could also make